Skip to content

Conversation

@neils-dev
Copy link
Contributor

@neils-dev neils-dev commented Apr 13, 2023

What changes were proposed in this pull request?

ScmDecommissioning scm admin client command support. Includes removing scm from ratis ring, error handling. Includes client and server side unit tests.

ozone admin scm decommission --nodeid

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8365

How was this patch tested?

Unit tests: manual command.

bash-4.2$ ozone admin scm decommission --nodeid=b5e33c32-2b5c-4905-99d8-dd4f1a19206c
Decommissioned Scm b5e33c32-2b5c-4905-99d8-dd4f1a19206c
bash-4.2$ ozone admin scm roles
scm1:9894:LEADER:a42a1405-bf50-403c-8db2-ee7eb2c49d64:172.25.0.9
scm2:9894:FOLLOWER:cb8676de-556e-41cb-a074-a2a2c0d0caf3:172.25.0.2

…g scm from ratis ring, error handling. Includes client and server side unit tests. TODO - scm revocation for decommissioning.'
@neils-dev neils-dev added the gr label Apr 13, 2023
@errose28
Copy link
Contributor

We should keep the CLI consistent with the existing decom CLIs. DN and OM uses ozone admin <component> decommission and the service ID is passed with a flag like --service-id instead of as an argument.

…ent with existing decommissioning commands, ozone admin <node type> decommission, and minor changes to parameters to be consistent with decommissioning commands.
@neils-dev
Copy link
Contributor Author

neils-dev commented Apr 15, 2023

We should keep the CLI consistent with the existing decom CLIs. DN and OM uses ozone admin <component> decommission and the service ID is passed with a flag like --service-id instead of as an argument.

Thanks @errose28 . The argument for this scm decommission command <clusterid> is expecting the cluster UUID, similar to CID-12f06022-02e0-46d8-9fdd-55c03f6cf1d9 . I don't believe the service-id is the same here (can be used interchangeably).

Pushed changes to make decommission commands consistent as suggested: ozone admin scm decommission -clusterid -nodeid

@errose28
Copy link
Contributor

@neils-dev Why do we need the cluster ID to decommission SCM? The SCM receiving the command already knows its own cluster ID and cannot make changes if there is a mismatch.

@neils-dev
Copy link
Contributor Author

neils-dev commented Apr 17, 2023

@neils-dev Why do we need the cluster ID to decommission SCM? The SCM receiving the command already knows its own cluster ID and cannot make changes if there is a mismatch.

Thanks for following up. The scm request for removing the scm, removeSCM, requires the clusterId from the scm client. On server side, the SCMHAManagerImpl, validates the request checking if the clusterId in the request matches the scm clusterId. see:

if (!request.getClusterId().equals(scm.getClusterId())) {

On clusterId mismatch, an exception is thrown that is propagated back to the admin cli user specifying the clusterId expected for the given scm.

@errose28
Copy link
Contributor

errose28 commented Apr 17, 2023

@nandakumar131 Why was cluster ID validation for SCM removal added in #4358 ? None of the other decommission requests, or any requests that I can think of, require the user to specify the cluster ID. Cluster ID is internally generated by Ozone and abstracted from users, who refer to the cluster by service ID. The only way for a user to learn the cluster ID is by manually checking VERSION files or startup log messages AFAIK.

@neils-dev
Copy link
Contributor Author

Cluster ID is internally generated by Ozone and abstracted from users

Thanks. Note that the clusterId can be avail to the user with the SCM web UI. It can provide the scmid and clusterid for the scm.

@nandakumar131
Copy link
Contributor

nandakumar131 commented Apr 18, 2023

Why was cluster ID validation for SCM removal added in #4358 ? None of the other decommission requests, or any requests that I can think of, require the user to specify the cluster ID. Cluster ID is internally generated by Ozone and abstracted from users, who refer to the cluster by service ID.

@errose28 you're right. The Cluster ID is internal, and we should not ask the user to provide the Cluster ID. In #4358 the Cluster ID is used internally for Ratis group removal as Cluster ID is used as Ratis Group ID.

The validation that is done in SCMHAManagerImpl#removeSCM is internal and should never fail for the SCM decommissioning command.

@neils-dev we should not get the Cluster ID from the user.

@neils-dev
Copy link
Contributor Author

Thanks @nandakumar131 , @errose28 . Just had an offline discussion with Nanda. Going to modify the scm client to scm client protocol server to simply require the nodeid to be decommissioned. Server side changes will then be made to the removePeerFromHARing to perform internal checking (clusterid, ratisaddr) and to decommission node. Thanks.

…quire the scmid (nodeid) for decommissioning an SCM. Previously required ratisaddress and clusterid now will be handled on handling and validation of scm decommission command server side, opened HDDS-8452.
@neils-dev
Copy link
Contributor Author

Updated scm decommissioning command to remove required clusterId. Command issued now with scmid as --nodeid containing UUID of scm to decommission.

bash-4.2$ ozone admin scm decommission --nodeid=b4ddaeaf-ab14-4570-94c0-416cd2505aa6
Decommissioned Scm b4ddaeaf-ab14-4570-94c0-416cd2505aa6
bash-4.2$ ozone admin scm roles
scm1:9894:LEADER:b903a2e3-1106-41e3-a10f-b4b97788601d:172.20.0.9
scm2:9894:FOLLOWER:61e8dadf-96f6-4ea1-9f31-d4722595ea2a:172.20.0.6

@neils-dev neils-dev marked this pull request as ready for review April 25, 2023 15:04
@neils-dev
Copy link
Contributor Author

Dependency pr supporting the admin scm decommissioning command merged and closed:
HDDS-8452. Scm server side RPC support for Scm Decommissioning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some more tests here to verify other scenarios as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neils-dev, thanks for working on this.
Please create a follow-up Jira to add more unit-tests.

@nandakumar131 nandakumar131 merged commit 977a271 into apache:master May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants